Lethal Cancer Risk Factors in NHANES

Multidimensional Analysis of
Potential Cancer Mortality Risk Factors in the
National Health and Nutrition Examination Survey

Martin Skarzynski
Capstone Advisor: Professor Elizabeth Platz

2018-05-12

Context

Large epidemiologic studies,

Goals

  1. Explore the exposures measured in the Third NHANES (NHANES III) dataset
  2. Search for factors associated with cancer mortality
  3. Assess a variable selection method for cancer risk prediction models

Participants

NHANES III collected data on 33,994 participants aged 2 months and older from 1988 to 1994 in the United States, including Interview, Medical Examination, and Laboratory components.

The National Center for Health Statistics (NCHS) linked NHANES III data with mortality data from the National Death Index (NDI).

From the initial pool of participants, we selected 16,404 adult (\(age \geq 18\)) participants who

Among the 16,404 participants, there were 964 cancer deaths and 280,891 total years of follow-up.

Exposures

The initial publicly available dataset contained 3,544 exposures from the Interview, Medical Examination, Laboratory, and Mortality components.

We obtained the final set of 243 exposures, after removing variables that were

Data Processing

We modified the SAS code provided by NCHS to save the data as comma-separated-value (.csv) files, which are available on FigShare. The SAS code files (.sas) and analogous Jupyter Notebook files (.ipynb) are available on GitHub.

Methods

  1. We fit thousands of Cox proportional hazards models with and without ridge penalties to randomly selected subsets of up to 50 variables.
  2. We analyzed the descriptions of NHANES III variables provided by the National Center for Health Statistics (NCHS) and selected 3 variables known to be related to cancer risk (age, sex and race/ethnicity) to include in all future models.
  3. We then compared highly significant variables (p < 10-10) that appeared most frequently in the Cox models and selected 5 high-frequency highly significant variables that we used to train a new group of models with fewer randomized variables.

Modeling

Analysis of variables

The most frequent highly significant variables

Apply domain knowledge to the top variables

Name Median HR n Description
HSAGEIR 1.04 2016 Age in Years
HAQ1 1.07 1415 How would you describe the condition of your natural teeth (excellent, very good, good, fair or poor)?
HAT2 1.50 384 In the past month, did you jog or run?
HAT16 1.67 256 In the past month, did you lift weights?
HAB7 0.99 228 In the past 12 months, how many times were you in a nursing home?
DMAETHNR 1.14 224 Race/Ethnicity
HAK9 1.23 216 How many times per night do you usually get up to urinate?
HAP2 0.81 179 Do you use glasses, contacts, or both?
HAT10 1.43 119 In the past month, did you do other dancing?
HAR1 0.63 96 Have you smoked at least 100 cigarettes during your entire life?

Results Summary

  1. The models with fewer randomized variables outperformed the fully randomized models in terms of concordance.
  2. Across all of the models, the ten variables that most frequently surpassed the p-value threshold were
    • age,
    • race/ethnicity,
    • lifetime consumption of more than 100 cigarettes,
    • 3 variables that pertain to physical activity, and
    • 3 variables that may be related to aging.

Conclusions

  1. The work described here constitutes an exploratory analysis of the NHANES III dataset that employs an iterative strategy for the generation of cancer risk prediction models.
  2. Looking beyond this demonstration of a variable selection method, our ultimate goal is to build upon previously-described cancer risk factors towards
    • the discovery of novel contributors to cancer risk,
    • a deeper understanding of cancer etiology, and
    • an improved ability to predict cancer incidence and mortality.